(Intercept) Time Unitk Unitl X1
-1.0 1.0 3.4 1.4 2.0
2025-06-10
Question
What happens when multiple units adopt the intervention at different times?
Example
Twenty states adopted COVID-19 vaccine mandates for state employees at different times.
Plot of state employee vaccination mandate timings, U.S. states, June 2021–February 2022
Multiple units, treated at different time points
Multiple time points of observations
Caution
The TWFE model can accommodate this statistically:
\[ Y_{it} = \alpha_i + \gamma_t + \theta I(X_{it} = 1)+\epsilon_{it} \]
But as written it assumes treatment effect homogeneity across time periods, time-on-treatment, and units: there is a single treatment effect \(\theta\).
In many settings, especially in epidemiology, heterogeneity is common, especially with non-randomized adoption.
Question
What might cause heterogeneity in the effect of a state employee COVID-19 vaccine mandate?
Goodman-Bacon (2021), Figure 1.
This complicates the interpretation of the estimand: it is still an ATT, but now that average is a weighted average that depends on the number of units at each switch point and the number of periods observed.
Plot of three-unit staggered adoption example with outcomes
Using all periods:
\[ \theta = \frac{1+1+1+3+3}{5} = 1.8 \]
Using only the first four periods:
\[ \theta = \frac{1+1+3}{3} \approx 1.67 \]
Using all periods:
(Intercept) Time Unitk Unitl X1
-1.0 1.0 3.4 1.4 2.0
Using only the first four periods:
(Intercept) Time Unitk Unitl X1
-1.000000 1.000000 3.571429 1.285714 1.857143
If all units are of the same size/variance, per Goodman-Bacon (2021) Thm. 1:
\[ s_{kU} \propto \bar{D}_k (1-\bar{D}_k) \\ s^k_{k \ell} \propto (\bar{D}_k -\bar{D}_\ell) (1-\bar{D}_k) \\ s^\ell_{k \ell} \propto (\bar{D}_k -\bar{D}_\ell) (1-\bar{D}_\ell) \]
where \(\bar{D}\) is the proportion of periods treated.
Using all periods:
\[ \bar{D}_k = 3/5, \quad \bar{D}_\ell = 2/5 \\ s_{kU} \propto 6/25, \quad s_{\ell U} \propto 6/25 \\ s^k_{k \ell} \propto 2/25, \quad s^\ell_{k \ell} \propto 2/25 \\ \hat{\theta} = \frac{6 \cdot 1 + 6 \cdot 3 + 2 \cdot 1 + 2 \cdot 3}{6 + 6 + 2 +2} = 2 \]
Using only the first four periods:
\[ \bar{D}_k = 2/4, \quad \bar{D}_\ell = 1/4 \\ s_{kU} \propto 2/16, \quad s_{\ell U} \propto 2/16 \\ s^k_{k \ell} \propto 2/16, \quad s^\ell_{k \ell} \propto 1/16 \\ \hat{\theta} = \frac{2 \cdot 1 + 2 \cdot 3 + 2 \cdot 1 + 1 \cdot 3}{2 + 2 + 2 + 1} \approx 1.857 \]
Goodman-Bacon (2021), Figure 3.
The weights on treatment effects can be non-convex (i.e., negative) if there are time-varying treatment effects.
This gives an uninterpretable TWFE estimand, and can even switch the sign of the estimate.
Time-varying effects can be broadly categorized into:
Dynamic treatment effects that depend on how long a unit has been treated, and
Calendar time effects that differ for all units based on the period of observation.
Both can be problematic, although calendar time effects can be handled similarly to unit-varying effects.
Goodman-Bacon (2021), Eqn 15
\[ plim_{N \to \infty} \hat{\theta} = \beta^{DD} = VWATT + VWCT - \Delta ATT, \]
where:
\(VWATT\) is the variance-weighted ATT (as in computation above),
\(VWCT\) is the variance-weighted deviation from parallel trends, and
\(\Delta ATT\) is the weighted sum of changes in the treatment effect within timing groups.
Goodman-Bacon (2021), Theorem 1
\[ \hat{\theta} = \sum_{k \neq U} \hat{\beta}^{2 \times 2}_{kU} + \sum_{k \neq U} \sum_{\ell > k} \left[ s^k_{k \ell} \hat{\beta}^{2 \times 2,k}_{k \ell} + s^\ell_{k \ell} \hat{\beta}^{2 \times 2,\ell}_{k \ell} \right] \]
where:
\(\hat{\beta}^{2 \times 2}_{kU}\) is a comparison of timing group \(k\) to untreated \(U\),
\(\hat{\beta}^{2 \times 2,k}_{k \ell}\) is a comparison of an early-treated group \(k\) to a late-treated group \(\ell\) in the time between the two switches, and
\(\hat{\beta}^{2 \times 2,\ell}_{k \ell}\) is a comparison of a late-treated group \(\ell\) to an early-treated group \(k\) in the time after \(\ell\) switched.
The TWFE model estimates a weighted average of all 2x2 DID comparisons.
Goodman-Bacon (2021), Figure 6.
We can observe the overall weight given to each treatment timing group, which may be negative if it is more often used as a control than a treated group.
Goodman-Bacon (2021), Figure 7.
Under unit-varying treatment effects, \(VWATT\) may not be a readily interpretable average of effects (and some may get negative weights).
Under any individual deviations from parallel trends (even if average does not deviate), \(VWCT\) may not be 0.
Under time-varying treatment effects, \(\Delta ATT \neq 0\): can result in negative weights on some effects.
Warning
Negative weights on some treatment effects can lead to averages outside of the range, or even a changed treatment effect sign.
One approach to avoiding forbidden comparisons is restricting the observations used to fit the model.
Borusyak et al. (2024) fit this model using only the not-yet-treated observations. They then use that to derive counterfactual outcomes for comparison.
Sun and Abraham (2021) use the approach with a clean “control” \(C\) that is either never-treated or last-treated. Their regression approach then implicitly weights by population share in each timing group.
Advantages:
Simple to implement
Straightforward interpretations for each treated unit
Disadvantages/Limitations:
Still opaque weighting of unit- or time-varying effects
Throws out potentially valuable information: inefficient
Use only the immediate switching effect. For each time period \(t\) with at least one unit untreated at \(t-1\) and treated at \(t\) and at least one unit untreated at both \(t-1\) and \(t\), compute:
\[ \begin{align*} \widehat{DID}_{+,t} = \frac{1}{N_{1,0,t}} &\sum_{i:D_{i,t}=1,D_{i,t-1}=0} \left( Y_{i,t} - Y_{i,t-1} \right) \\ &- \frac{1}{N_{0,0,t}} \sum_{i:D_{i,t}=D_{i,t-1}=0} \left( Y_{i,t} - Y_{i,t-1} \right) \end{align*} \]
Average these switcher estimates across all time periods \(t\), weighted by number of units or individuals.
See de Chaisemartin and d’Haultfoeuille (2020) and de Chaisemartin and d’Haultfoeuille (2023), or crossover estimate in Kennedy-Shaffer et al. (2020).
Advantages:
All switching units (except possibly last) are included equally
Can restrict to only clean comparisons
Avoids dynamic treatment effects
Disadvantages/Limitations:
Throws out a lot of information: inefficient
Does not capture the full scope of treatment effects
Need to be very careful about wash-out periods and interpretations
Callaway and Sant’Anna (2021) propose to estimate \(\widehat{ATT}_{g,t}\) for each timing group \(g\) and period \(t\) using a non-parametric scheme compared to the last pre-treatment period: suggested approaches are IPW, OR, and DR.
Then summarize to an overall average effect weighted by \(w_{g,t}\):
\[ \theta = \sum_g \sum_{t=2}^T w_{g,t} ATT_{g,t}. \]
Let
\[ ATT(g,t) = E[Y_{it}(g) - Y_{it}(0)], \]
the group-time ATT in period \(t\) for a unit first treated in period \(g\), compared to if it had never been treated (or not yet treated by period \(t\)).
Many solutions boil down to considering which group-time ATTs should be included in the estimand, how they differ, and how to weight them.
TWFE assumes \(ATT(g,t) = \theta\) for all \(g \le t\).
\[ Y_{it} = \alpha_i + \gamma_t + \sum_{k \neq 0} \delta_{k} I(K_{it} = k)+\epsilon_{it}, \]
where \(K_{it}\) is the lead/lag for unit \(i\) in period \(t\) (e.g., \(K_{it} = 1\) in the first exposed period).
See Borusyak and Jaravel (2018) and Borusyak et al. (2024). Captures time-on-treatment heterogeneity.
Also useful to test for “pre-trends” in single intervention time setting.
We can account for timing cohort heterogeneity as well by further allowing the effect to vary by adoption timing group (\(G_i\)):
\[ Y_{it} = \alpha_i + \gamma_t + \sum_g \sum_{k \neq 0} \delta_{g,k} I(G_i = g) I(K_{it} = k) + \epsilon_{it} \]
Various methods use this approach, and differ in which comparisons/observations they allow and how they combine results. This implies different assumptions and bias-variance tradeoffs.
More generally, can weight across all 2x2 DID comparisons, with weights chosen to target a specific estimand and then minimize variance.
\[ \hat{\theta} = \sum_{i,i',t,t'} w_{i,i',t,t'} \left[ \left( Y_{i,t'} - Y_{i,t} \right) - \left( Y_{i',t'} - Y_{i',t} \right) \right] \]
Review/survey papers:
These are complicated by staggered adoption and the longer time frames implied by panel data. Recent work has focused on how to interpret and test for these assumptions and how to incorporate time-varying covariates.
The no-anticipation (or known/limited anticipation) assumption still must hold, as must the no-spillover assumption.
All of these approaches change the precise specification of the estimand as well: the ATT must be interpreted in terms of the included time periods, lags, and units, and how they are weighted.
Important
It’s easy to ignore the fundamentals when using the more advanced methods. Consider the validity of the data, the question being asked, and the feasibility of the effect.
Consider data source carefully
Think about possible heterogeneities and desired estimand
Use graphical displays and diagnostics to assess possible biases and trade-offs
Consider multiple estimation methods for robustness to different assumptions
Pre-specify, and explore with appropriate caveats